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Abstract 



The number of species can be estimated by sampling individuals from a species assemblage. The problem of esti- 
mating generalized species accumulation curve is addressed in a nonparametric Poisson mixture model. A likelihood- 
based estimator is proposed and illustrated by real examples. 
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1 Introduction 

An important but difficult problem in ecological studies is estimating species richness, i.e., the number of species 
in an assemblage based on an incomplete survey (Colwell and Coddington 1994). The same problem also arises 
from various other scientific fields (Bunge and Fitzpatrick 1993). In the survey, individuals are selected from the 
species assemblage and their species identities are recognized. The species accumulation curve (SAC) is the plot 
of the expected number of species against the measure of sampling effort, which serves a variety of purposes in 
ecological studies such as comparison among species assemblages and prediction of expected number of new species 
(e.g., Hurlbert 1971; Colwell and Coddington 1994; Shen et al. 2003; Mao 2005). The estimand of a nonparametric 
species richness estimator is also often plotted against the measure of sampling effort, called a generalized SAC and 
and used like the usual SAC (Colwell and Coddington 1994). Although estimating the usual SAC has been extensively 
studied (e.g., Mao 2005), little investigation has been made to estimate generalized SACs. A computationally intensive 
randomization procedure is usually used by ecologists and conservation biologists. 

Consider a species assemblage consisting of s distinct species labeled by i = 1, 2, . . . , s. The sampling of 
individuals from species i is often modeled as a Poisson process with rate Ai over time t e [0, oo) (e.g., Efron and 
Thisted 1976; NoiTis and Pollock 1998; Mao 2004, 2005). Let Yi{t) be the number of individuals from species i 
during [0, t\. Conditioning on h{t) = J2i=i ^i(^)' the Yi{t) arise as a multinomial sample of size h{t) with index s 
and probabilities pi ~ \l Yli'j^i i^-S-' Chao 1984). When the rates are assumed to arise as a random sample 
from a mixing distribution Q — Ylu=i ^uSiju), where S{X) is a distribution degenerate at A, the Yi{t) become a 
random sample from a Poisson mixture (e.g., Mao 2004). 



Let nj{t) = J2i=i ^O^ii^) = j)' where /(•) is the indicator function. Let n{t) = {ni{t), n2{t), . . . ) and ip{t) = 
E{n{t)} ^ (<?!)i(0, 02 (i),...). where 




(1) 
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Let n+(t) be the number of observed species with expectation <^+(t), where 



oc 



oc 



A nonparametric estimator for the number of species s is a function G{n{t)) which estimates G{4>{t)), a parameter 
that approximates s. Note that n+{t) is such an estimator. Another example is the estimator in Chao (1984), 



When the sampling is stopped at i = to, one has a vector of observed counts n{to). We will consider the problem 
of estimating G{(j){t)) based on n(to)- The special case of estimating was considered by Good and Toulmin 

(1956), Efron and Thisted (1976), Shen et al. (2003) and Mao (2005). 

The problem can be reduced to estimating (pit). Good and Toulmin (1956) provided an estimator for (f>{t). The 
Good-Toulmin estimator usually behaves badly at i > 2to and often produces inadmissible values (e.g., negative 
values) for t E {to,2to]. We will develop a likelihood-based estimator, which competes with the Good-Toulmin 
estimator att E [0, 2to] as its smoothed version. The likelihood-based estimator is particularly useful when the Good- 
Toulmin estimator fails. Our approach is different from that in Norris and Pollock (1998) because we do not require 
an estimator for s, a parameter that is difficult to estimate. We will also show that the commonly used randomization 
procedure is unnecessary because it is a simulation-based approximation to an enumeration procedure which yields an 
estimator close to the Good-Toulmin estimator. 

The estimation methods are detailed in Section 2. Numeric studies are reported in Section 3. The proofs are 
provided in the Appendix. The R codes are available from the author on request. 



For notational convenience, we will assume that time is scaled such that to = 1. Therefore, the fuU UkeUhood 
Po{s, &) is given by 




2 Methods 
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po{s,e) = 



{5-n+(l)}!n' 




n,(l)! 



where g@ is a mixture of Poisson densities. 



9eij) = ^'^u exp(-7„)7i(j!) \ j = 0, 1, . . . . 



The Good-Toulmin estimator cpj (t) can be written as 




(2) 



This estimator can arise from the following identity 




(3) 
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when one estimate (f>x{i) — sge{x) by 

Let d = max{j : nj{l) > 0}. We can write (i>j{t) as 

d 



The last term of the series in @ dominates soon after t > 2, and (j>j (t) diverges to infinity or minus infinity as t 
increases, depending on whether d — j is even or odd. This might invite one to replace both s and 6 with their 
estimators in 4>j{t). For example, Norris and Pollock (1998) provided nonparametric likelihood estimators for s and 
O by a procedure that is computationally very expensive. 

Because s is difficult to estimate (e.g., Bunge and Fitzpatrick 1993), we will show that estimating 4){t) does not 
necessarily require an estimator for ,s. Note that po(s, 6) — Pi{s, 8)p2(0, "•+(!))> where pi(s, 9) is the binomial 
density of and P2(0, "+(!)) is the multinomial density of nil) given 



= 7 nuT^^'e -ge(0)} 



V2{Q,n+(l)) = =oo 7TTT I I \ -j 777: 



n,(l) 



We will reformulate ^2(0, "+(!)) by introducing Q = X]u=i '^m'^(7«)' where 

_ 7r„{l - exp(-7„)} 

-exp(-7»^)}' 

Let Jq be a mixture of zero-truncated Poisson densities, where 

Because it can be shown that /qQ) = ffe(j)/{l ^ 5e(0)} (e.g., Mao 2004), we can rewrite p2{Q,nj^{l)) as 
L((5, rt+(l)), where 

Proposition 1 For j = 1, 2, . . . , h, and h — 1, 2, ... , 

^j{t)^M^Wi{t,Q), (5) 

where 9j {t, Q) a functional of the mixing distribution Q, 

^1 {1 -exp(-7„)}j! 

The nonparametric maximum likelihood estimator (NPMLE) denoted by Q ~ Ylu=i '^uS{'ju) maximizes L(Q, n+(l)) 
(Lindsay 1983; Mao 2004). Because (1) estimates ^!>+ (1), from a likelihood-based estimator (j>j (t) for (pj (t) is 
given by 

^,{t)=n+{l)9,{t,Q). (6) 
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Note that 0j (t) is a smoothed version of </)j (t) in (|2j because 



fc=0 



J 



The fitted density fgix) is used to estimate fqix) and yield while the empirical density Jq{x) — nx{l)/n+{\) 
is used to estimate fgix) and yield (pj {t). 

The function G{(j){t)) can be estimated by G{(p{t)) and G{(p{t)). The estimator G(n(l)) is reproduced by 
G(0(1)) = G(n(l)). A bootstrap procedure is recommended for construction of confidence intervals for G{(j){t)): 
sampling from its estimated binomial density and sampling n*{l) from L{Q, n^(l)). A lower confidence limit 

for G{4){t)) is also a lower confidence limit for s when G{4){t)) is a lower bound to s, e.g., 4i+{t) and Gc{4'{t))- 

It is difficult to estimate 0i (t) reliably when t is relatively large. One reason is that, although 'ju > in Q for all 
u, the smallest support point (say 71) of Q might be close or identical to zero. When 71 ~ 0, it is easily shown that 



u=2 



{1 - exp(-7„)}jr 



When t is sufficiently large, (t) will increase approximately linearly but each (t) with j > 2 will approach zero. 
This fact explains the observation that is approximately linear for a large t (Mao 2005). The estimator G{c/){t)) 

might also be driven up to infinity as t increases. For example, if 71 = 0, then there is /3 > 2 with 7^ < 7„ for all 

u > 2 and u ^ (3, and 

lim ^^('^W) = n+(l)^^{l-exp(-7^)} 
t^oo exp(7;3t) 2a)/37^ 

i.e., Gc((/)(i)) increases approximately exponentially for a large t. However, our likelihood-based method can be useful 
for relatively small t (e.g., t E [1,3] with <o = 1. the range of t that serves practical purposes). 

Finally we turn to the multinomial model. Let Xi{h) be the number of individuals from species z in a sample of 
size h and mj{h) — J2i=i = j)- This means that Xi{h{t)) = Yi{t) and mj{h{t)) = nj{t). Note that 

E{m,{h)}^j^(%l{l-p,f-^. 
Let a — h{l) be the number of sampled individuals during [0, 1]. For h — 1,2, . . . , a, one has 



^ ^ fa — h\ f a 



fc=0 



(^)=E(;-j( -1,2,...,/., (8) 



which is based on the following identity (Good and Toulmin 1956) 

E{ra,{h)} = Q k^){kl 3) («)}' = 1, 2, . . . (9) 



fe=0 



In the ecology literature, a randomization procedure is usually used. It is an approximation to an enumeration 
procedure: taking all subsamples of size h, calculate mj (h) with j > h for each subsample and obtain their fhj (h). 
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Proposition 2 For j = 1, 2, . . . , h and h = 1, 2, 

a — h 



k + j\fa — k — j\fa 
h-J 



nik+jia). 



(10) 



Hurlbert (1971) found the analytic expression of m^{h) = X^i^i '^i(^)' 



a — h 



x = l 



a — h 



By comparing (|8j and ilO\ . it is clear that nij (h) = rhj {h) because 

{'V)rH-7) OC'^') _ {k + ma-k-j)W.{a-h)l 



ik+j) 



j\k\{h - j)\{a - k - h)\a\ 



Although the identity in (|3} holds for all t > 0, the identity in (|9} does not hold for h > a. One can obtain an 
approximation to E{mj [h)} as a function of those E{mj (a)} and develop a biased estimator for E{mj [h)}. 

Since nrn,{a) = rn,{l), we can write ffij{h) as 



min(a — 



a — b\ (a 
h-J 



(11) 



The number of sampled individuals during [0, h/a] is about h. We consider comparing the estimators (h/a) in 
and fhj{h) in (IIU . Clearly fhj{h) — (j)j{h/a) ~ when j > d. When j < d, write (j)j{h/a) — mj{h) = ei + £2, 
where 



6— min(a— /i+J,c^) + l 

min(a— /i.+j.d) 

E 



^^^(Vay{(«-/i)/a}'''"b(l) 
(Vay{(a-^)/a}''"^' 



T-r h — U T-r a — h — w 

a — u a — j — w 



nfc(l). 



Note that ei = when a — h + j > d. When a — h+j<d, h/ais close to one because d ^ a, which implies 
that ei « 0. By simple algebra, one can also find that £2 ~ 0. Conclude that mj{h) « (pjlh/a). When the rhj{h{t)) 
are used to estimate G{4>[t)), the resulting estimator will be close to G{(p{t)). For example, fh+{h) and (j)+{h/a) are 
close to one another (Brewer and Williamson 1994). 



3 A real example 

We consider a real example from Miller and Wiegert (1989) that concerns plant species in the central Appalachian 
region. This example was also investigated in Shen et al. (2003). There were n+{\) — 188 species identified from 
a ^ h{l) = 1008 individuals with n^{l) = 61, 35, 18, 12, 15, 4, 8, 4, 5, 5, 1, 2, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1 and 1 at 
a; = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 19, 20, 22, 29, 32, 40, 43, 48 and 67. 

The NPMLE Q is shown in Table[l] The estimates (t>j{t), 4>+{t) ^nd Gc{(t>{t)) are shown in Figures ^ and |2] 
We also compare (pjlh/a) and fhj{h) for 1 < h < a, and (i>.j{t) and 4'j{'t) for < t < 1. The results are shown in 
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Table|2l We also calculate maxo<t<i \cf>+{t) - <j)+{t)\ = 0.06 and niaxo<t<i \Gc{ct>{t))-Gc{${t))\ = 2.58. Note that 
4>j {h/a) and fhj (h) have little difference. The difference between (f>j (t) and 0j (t) comes from the difference between 
nx{l) and n+(l)/g(a:), e.g., nx{5) ~ 15 and n+(l)/g(5) — 10.4. Although 4'j{t) can be computed for i > 1, it 

becomes inadmissible even for some t < 2, e.g., 02(1-57) = —2.6 and 04(1.57) = —192.5, Gc{<f){l-57)) = —497.1. 
To construct lower confidence limits for Gc{4>{t)), we generate 400 bootstrap resamples. For example, the bootstrap 
95% lower confidence limits for Gc(0(t)) at i = 1, 1.2, 1.4, 1.6, 1.8 and 2.0 are 218.1, 220.6, 221.5,222.2, 222.4 and 
222.4 while the estimates Gc{(j){t)) ai-e 243.7, 248.7, 251.5, 252.9, 253.7 and 254.1. Note that an upper confidence 
limit at a relatively large t is usually noninformative. For example, the 95% upper confidence limits for Gc{4>{t)) at 
t = 2 and t — 3 are 81 1.8 and 2230.8 respectively, much larger than the corresponding lower confidence limits 222.4 
and 222.6. 

In order to evaluate the likelihood-based method, we consider simulation under various combinations of Q and 
s. We find that the distribution of 4)j{t) is right skewed when t > 2 and in particular, the distribution of 0i(t) has a 
long right tail for a large t, like (t)+{t) and Gc{4>{t)) although the 3rd quartile of Gc{4>{t)) increases faster than that of 
0+ {t) or 4>i [t). In the future, we will consider generalized SACs for various nonparametric estimators (e.g., Chao and 
Bunge 2002). 

Table 1: The NPMLE Q with £> = 7 from the plant data. 

% 0.864 3.554 7.412 15.306 30.564 41.892 66.416 
uju 0.475 0.260 0.158 0.074 0.010 0.017 0.005 



Table 2: Comparison of three types of estimates 0j(i), (t>j{t) and mj{h) with Aj = maxi</i<a \<j>j{h/a) — ■mj{h)\ 
and Dj = maxo<t<i \4>j{t) - 4>j{t)\- 
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4 


5 


6 


7 


8 


9 


10 


A, 


0.13 
0.44 


0.03 
1.14 


0.02 
1.01 


0.01 
1.35 


0.01 
4.56 


0.01 
4.26 


0.01 
1.44 


0.01 
1.21 


0.01 
1.03 


0.01 
1.78 
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Figure 1 : The likelihood-based estimates (l>j {t) of the expected counts 0j {t) for j = 1 (solid), 2 (dashed), 3 (dotted) 
and 4 (dot-dashed). 
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Appendix 

To prove Proposition 1, write 



^+(1) s - s EJl=i TTu; exp(-7^i) 

7r„{l - exp(-7„t)} exp(-7„t)(7tit)-' 



2 Y.w=i - exp(-7^i)} {1 - exp(-7„i)}j! ' 

To prove Proposition 2, let the individuals be labeled by j = 1,2, ... , oand Zij = /(individual j is from species i). 
A subsample oj consists of h individuals. Let be the set of all such subsamples. With (^) = if a < /3, write 

a)%w = EE^(E^-=j-)=E E E<E^-=i) 

a;6n i=l reui t=0 {i:Yj(a)=t} wGQ rQut 

a a—h+j a—h 

= E E Q (r •) = E Q (r>*(«) = E CV) (" 

t=0 {i:Yi(a)=t} t=j k=0 
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